Introduction

This challenge uses aggregate summaries of runs scored per over for the first innings of Indian Premier League matches to predict runs earned.

Question 1

Derive a variable runs earned that is the total runs scored from any combination of overs_remaining or wickets_remaining.

ipl <- read_csv("ipl.csv")

Part a

The following function derives the dependent variable of runs earned for this analysis.

REarned <- function(x) { #Right
x <- x %>%
dplyr::group_by(id) %>%
dplyr::mutate(Total = sum(runs_scored),runs_earned = Total + runs_scored - cumsum(runs_scored),Total = NULL)
}


ipl <- REarned(ipl)

Part b

Identify if there are any outliers in runs_earned and describe what you find.

Initial plots examine the shapes of the variables and possible distributions of the runs earned variable.

par(mfrow=c(1,1))

hist(ipl$runs_earned,main="Histogram of Runs Earned",xlab = "Runs Earned")

#Some outliers in right tail
ipl$year <- format(ipl$date, format = "%Y")

#table(ipl$runs_earned)
#table(ipl$year)
#table(ipl$id)
#summary(ipl$runs_earned)


#plot(ipl$overs_remaining, ipl$runs_earned)

qqnorm(ipl$runs_earned, pch = 1, frame = FALSE) #Not Normal
qqline(ipl$runs_earned, col = "steelblue", lwd = 2)

#Potential distributions
descdist(ipl$runs_earned, discrete=FALSE, boot=500)

## summary statistics
## ------
## min:  0   max:  252 
## median:  86 
## mean:  87.20415 
## estimated sd:  48.29643 
## estimated skewness:  0.2089738 
## estimated kurtosis:  2.246711
descdist(ipl$runs_earned, discrete=TRUE, boot=500)

## summary statistics
## ------
## min:  0   max:  252 
## median:  86 
## mean:  87.20415 
## estimated sd:  48.29643 
## estimated skewness:  0.2089738 
## estimated kurtosis:  2.246711
#Rosner Test Outliers detected?
test <- rosnerTest(ipl$runs_earned,
  k = round(0.01*nrow(ipl)))

test$distribution
## [1] "Normal"
test$n.outliers
## [1] 0

When examining the plots of the overs remaining and wickets remaining as they directly relate to runs earned there appears to be outliers for each value of overs remaining. There is also obvious heteroskedasticity that will be explored further on in the analysis.

ggplot(ipl, aes(overs_remaining,runs_earned))+ 
  geom_point()+
  geom_smooth(method="lm", se=F)+
  labs(title="Runs Earned by Overs Remaining",
        x ="Overs Remaining", y = "Runs Earned")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(ipl, aes(wickets_remaining,runs_earned))+ 
  geom_point()+
  geom_smooth(method="lm", se=F)+
  labs(title="Runs Earned by Wickets Remaining",
        x ="Wickets Remaining", y = "Runs Earned")
## `geom_smooth()` using formula = 'y ~ x'

In order to deal with the outliers identified for each of the overs remaining all observations below the 0.01 and over the 0.99 quantiles were removed from the analysis. The plot below shows data without the identified outliers.

#hist(subset(ipl$runs_earned,ipl$overs_remaining==16))


ipl <- ipl %>%
dplyr::group_by(overs_remaining) %>%
dplyr::mutate(LOWQ = ifelse(runs_earned < quantile(runs_earned,.01), "Yes" , "No"),UPPQ = ifelse(runs_earned > quantile(runs_earned,.99), "Yes" , "No"))


table(ipl$LOWQ)
## 
##    No   Yes 
## 18080   186
ggplot(subset(ipl,ipl$LOWQ == 'No' & ipl$UPPQ == 'No')
       , aes(overs_remaining,runs_earned))+ 
  geom_point()+
  geom_smooth(method="lm", se=F)+
  labs(title="Runs Earned by Overs Remaining",
       subtitle = "Outliers Removed",
        x ="Overs Remaining", y = "Runs Earned")
## `geom_smooth()` using formula = 'y ~ x'

Part c

The outliers identified in were removed from the analysis leaving a total of 17891 observations.

ipl <- subset(ipl,ipl$LOWQ == 'No' & ipl$UPPQ == 'No' )

Question 2

Create a visualization to show how runs earned varies with overs and wickets remaining. Interpret the relationship you observe.

The plots in Question 1 indicate that there is a positive linear correlation between runs earned and both overs and wickets remaining. This observation makes sense as teams with more overs and wickets remaining have more opportunities to score runs.

The three dimensional interactive plot below further supports the assumption of a positive correlation. Furthermore, there appears to be heteroskedasticity with a larger variability present when teams have more overs and wickets remaining. The data will have to be transformed, or any models built to estimate the average expected runs will need to be able to deal with this heteroskedasticity.

fig <- plot_ly(ipl, y = ~wickets_remaining, z = ~runs_earned, x = ~overs_remaining, color = ~runs_earned)
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene = list(xaxis = list(title = 'Overs Remaining'),
                     yaxis = list(title = 'Wickets Remaining'),
                     zaxis = list(title = 'Runs Earned')))

fig
Batting Team

The next two plots split the data by the batting team to examine any potential impacts of who is batting on the relationship between overs remaining or wickets remaining and earned runs.

fig1 <- plot_ly(ipl,  y = ~runs_earned, x = ~overs_remaining, color = ~batting_team)
fig1 <- fig1 %>% add_markers()
fig1 <- fig1 %>% layout(title = 'Runs Earned by Overs Remaining Colored by Batting Team',xaxis = list(title = 'Overs Remaining'),
                     yaxis = list(title = 'Runs Earned'))

fig1

The reader is able to select which batting teams they wish to include in the above plot and compare the shapes and spreads of the runs earned by overs remaining. For example, when only selecting the Chennai Super Kings and the Pune Warriors it appears, although not significantly, that the Chennai Super Kings earn more runs on average than the Pune Warriors. Furthermore, the variability of runs earned is much larger for the Chennai Super Kings.

fig2 <- plot_ly(ipl,  y = ~runs_earned, x = ~wickets_remaining, color = ~batting_team)
fig2 <- fig2 %>% add_markers()
fig2 <- fig2 %>% layout(title = 'Runs Earned by Wickets Remaining Colored by Batting Team',
                        xaxis = list(title = 'Wickets Remaining'),
                     yaxis = list(title = 'Runs Earned'))

fig2

The above plot allows the reader to again select which batting teams they wish to compare with regards to runs earned and wickets remaining. Following the above example and comparing the Chennai Super Kings and the Pune Warriors, it appears that the slope is very similar and the variability of the Chennai Super Kings is larger.

Bowling Team

The following plots allow the reader to select which bowling teams they wish to compare with regards to runs earned by overs remaining and wickets remaining.

fig3 <- plot_ly(ipl,  y = ~runs_earned, x = ~overs_remaining, color = ~bowling_team)
fig3 <- fig3 %>% add_markers()
fig3 <- fig3 %>% layout(title = 'Runs Earned by Overs Remaining Colored by Bowling Team',
                        xaxis = list(title = 'Overs Remaining'),
                     yaxis = list(title = 'Runs Earned'))

fig3
fig2 <- plot_ly(ipl,  y = ~runs_earned, x = ~wickets_remaining, color = ~bowling_team)
fig2 <- fig2 %>% add_markers()
fig2 <- fig2 %>% layout(title = 'Runs Earned by Wickets Remaining Colored by Bowling Team',
                        xaxis = list(title = 'Wickets Remaining'),
                     yaxis = list(title = 'Runs Earned'))

fig2
Interactions

The next plot attempts to identify any interaction between wickets remaining and overs remaining as it may relate to runs earned.

ipl$WR <- factor(ipl$wickets_remaining)

ggplot(ipl, aes(overs_remaining,runs_earned))+ 
  geom_point()+
  ylim(0, 255)+
  geom_smooth(method="lm", se=T)+
  facet_wrap(~WR, scale="free")+
  labs(title="Runs Earned by Overs Remaining across Wickets Remaining (Interactions)",
        x ="Overs Remaining", y = "Runs Earned")
## `geom_smooth()` using formula = 'y ~ x'

This interaction plot above shows that teams who have many wickets remaining and few overs remaining earn more runs. In general, it appears that there is some kind of interaction between wickets remaining and overs remaining with regards to explaining runs earned.

Question 3

Build a model to estimate the average expected runs that can be earned from any possible combination of overs and wickets remaining. Write the mathematical description of your model and justify your choice. Show code to fit the model and evaluate it’s performance.

Model Construction and Description

From Question 2, the runs earned variable appears to be normally distributed. With that in mind, a simple linear regression may be estimated. Models are estimated as,

\(runs\_earned\) = \(\alpha\) + \(\beta_1overs\_remaining\) +\(\beta_2wickets\_remaining\)+\(\beta_3overs\_remaining*wickets\_remaining\)+\(\delta_ibatting\_team\) +\(\delta_jbowling\_team\)

where \(i\) and \(j\) represent the team batting and bowling respectively. A second linear regression model with the overs remaining treated as a factor was also estimated. Wickets remaining was left as a continuous variable as the different levels were not significant as factors.

lm1 <- lm(runs_earned~overs_remaining+wickets_remaining+
            overs_remaining:wickets_remaining+batting_team+bowling_team,data=ipl)
#summary(lm1)

lm1%>%
  tidy() %>%
  kable(digits = 3,col.names = c("Variable", "Estimate", "S.E.", "Stat.", "p-value"),
        align = c("l", "c", "c", "c", "c"),caption = 'OLS Regression Estimates')%>%
    kable_classic_2(full_width = F, position = "left")%>%
    footnote(general = "Adjusted R-Squared: 0.833",
    general_title = "Note.",
    footnote_as_chunk = TRUE)
OLS Regression Estimates
Variable Estimate S.E. Stat. p-value
(Intercept) -7.727 1.229 -6.286 0.000
overs_remaining 5.520 0.139 39.787 0.000
wickets_remaining 3.980 0.156 25.520 0.000
batting_teamDeccan Chargers -4.689 0.797 -5.883 0.000
batting_teamDelhi Capitals -2.637 0.967 -2.727 0.006
batting_teamDelhi Daredevils -3.070 0.687 -4.469 0.000
batting_teamGujarat Lions -2.897 1.262 -2.295 0.022
batting_teamGujarat Titans 0.056 2.226 0.025 0.980
batting_teamKings XI Punjab -2.740 0.627 -4.374 0.000
batting_teamKochi Tuskers Kerala -17.124 1.706 -10.038 0.000
batting_teamKolkata Knight Riders -3.145 0.619 -5.084 0.000
batting_teamLucknow Super Giants 4.170 1.882 2.216 0.027
batting_teamMumbai Indians 0.508 0.588 0.865 0.387
batting_teamPune Warriors -10.591 1.071 -9.890 0.000
batting_teamPunjab Kings -5.737 1.194 -4.806 0.000
batting_teamRajasthan Royals -5.045 0.644 -7.839 0.000
batting_teamRising Pune Supergiant -2.695 1.606 -1.678 0.093
batting_teamRising Pune Supergiants -9.499 1.751 -5.424 0.000
batting_teamRoyal Challengers Bangalore 0.566 0.615 0.920 0.357
batting_teamSunrisers Hyderabad -3.739 0.668 -5.599 0.000
bowling_teamDeccan Chargers 1.932 0.907 2.131 0.033
bowling_teamDelhi Capitals 4.039 0.908 4.450 0.000
bowling_teamDelhi Daredevils 1.526 0.661 2.310 0.021
bowling_teamGujarat Lions 3.766 1.241 3.035 0.002
bowling_teamGujarat Titans 12.554 2.046 6.136 0.000
bowling_teamKings XI Punjab 3.110 0.656 4.743 0.000
bowling_teamKochi Tuskers Kerala -9.504 1.749 -5.435 0.000
bowling_teamKolkata Knight Riders -2.820 0.619 -4.557 0.000
bowling_teamLucknow Super Giants 5.444 2.235 2.436 0.015
bowling_teamMumbai Indians -1.661 0.639 -2.601 0.009
bowling_teamPune Warriors -3.948 0.997 -3.960 0.000
bowling_teamPunjab Kings -2.744 1.730 -1.586 0.113
bowling_teamRajasthan Royals 1.157 0.644 1.797 0.072
bowling_teamRising Pune Supergiant 5.517 1.738 3.175 0.002
bowling_teamRising Pune Supergiants -3.551 1.713 -2.073 0.038
bowling_teamRoyal Challengers Bangalore 0.704 0.629 1.119 0.263
bowling_teamSunrisers Hyderabad -2.126 0.702 -3.030 0.002
overs_remaining:wickets_remaining 0.088 0.015 5.734 0.000
Note. Adjusted R-Squared: 0.833
lm2 <- lm(runs_earned~factor(overs_remaining)+wickets_remaining+
            overs_remaining:wickets_remaining+batting_team+bowling_team,data=ipl)
#summary(lm2)

lm2%>%
  tidy() %>%
  kable(digits = 3,col.names = c("Variable", "Estimate", "S.E.", "Stat.", "p-value"),
        align = c("l", "c", "c", "c", "c"),caption = 'OLS Regression Estimates with Overs Remaining as a Factor')%>%
    kable_classic_2(full_width = F, position = "left")%>%
    footnote(general = "Adjusted R-Squared: 0.0.837",
    general_title = "Note.",
    footnote_as_chunk = TRUE)
OLS Regression Estimates with Overs Remaining as a Factor
Variable Estimate S.E. Stat. p-value
(Intercept) 6.675 1.326 5.033 0.000
factor(overs_remaining)2 6.799 0.926 7.340 0.000
factor(overs_remaining)3 12.369 0.937 13.201 0.000
factor(overs_remaining)4 17.205 0.967 17.797 0.000
factor(overs_remaining)5 21.375 1.017 21.021 0.000
factor(overs_remaining)6 25.345 1.090 23.259 0.000
factor(overs_remaining)7 28.653 1.189 24.096 0.000
factor(overs_remaining)8 31.425 1.310 23.980 0.000
factor(overs_remaining)9 33.888 1.453 23.316 0.000
factor(overs_remaining)10 35.886 1.615 22.220 0.000
factor(overs_remaining)11 37.842 1.792 21.122 0.000
factor(overs_remaining)12 39.234 1.992 19.701 0.000
factor(overs_remaining)13 40.479 2.199 18.407 0.000
factor(overs_remaining)14 40.898 2.422 16.888 0.000
factor(overs_remaining)15 42.326 2.668 15.865 0.000
factor(overs_remaining)16 43.576 2.926 14.893 0.000
factor(overs_remaining)17 44.158 3.201 13.793 0.000
factor(overs_remaining)18 44.636 3.486 12.805 0.000
factor(overs_remaining)19 44.036 3.776 11.663 0.000
factor(overs_remaining)20 42.511 4.065 10.458 0.000
wickets_remaining 0.800 0.216 3.697 0.000
batting_teamDeccan Chargers -4.799 0.788 -6.093 0.000
batting_teamDelhi Capitals -2.886 0.955 -3.021 0.003
batting_teamDelhi Daredevils -3.029 0.679 -4.464 0.000
batting_teamGujarat Lions -3.123 1.247 -2.504 0.012
batting_teamGujarat Titans 0.672 2.200 0.305 0.760
batting_teamKings XI Punjab -3.071 0.619 -4.960 0.000
batting_teamKochi Tuskers Kerala -17.713 1.686 -10.507 0.000
batting_teamKolkata Knight Riders -3.243 0.611 -5.306 0.000
batting_teamLucknow Super Giants 4.293 1.859 2.309 0.021
batting_teamMumbai Indians 0.264 0.581 0.455 0.649
batting_teamPune Warriors -10.744 1.058 -10.154 0.000
batting_teamPunjab Kings -6.267 1.180 -5.312 0.000
batting_teamRajasthan Royals -5.411 0.636 -8.506 0.000
batting_teamRising Pune Supergiant -2.851 1.587 -1.797 0.072
batting_teamRising Pune Supergiants -9.578 1.730 -5.535 0.000
batting_teamRoyal Challengers Bangalore 0.513 0.607 0.845 0.398
batting_teamSunrisers Hyderabad -3.847 0.660 -5.830 0.000
bowling_teamDeccan Chargers 2.065 0.896 2.305 0.021
bowling_teamDelhi Capitals 4.008 0.897 4.469 0.000
bowling_teamDelhi Daredevils 1.555 0.653 2.383 0.017
bowling_teamGujarat Lions 3.990 1.226 3.253 0.001
bowling_teamGujarat Titans 12.916 2.022 6.389 0.000
bowling_teamKings XI Punjab 3.106 0.648 4.794 0.000
bowling_teamKochi Tuskers Kerala -9.250 1.728 -5.353 0.000
bowling_teamKolkata Knight Riders -2.922 0.612 -4.778 0.000
bowling_teamLucknow Super Giants 5.762 2.208 2.609 0.009
bowling_teamMumbai Indians -1.566 0.631 -2.481 0.013
bowling_teamPune Warriors -3.975 0.985 -4.035 0.000
bowling_teamPunjab Kings -3.166 1.709 -1.852 0.064
bowling_teamRajasthan Royals 1.265 0.637 1.987 0.047
bowling_teamRising Pune Supergiant 5.169 1.717 3.010 0.003
bowling_teamRising Pune Supergiants -3.492 1.693 -2.063 0.039
bowling_teamRoyal Challengers Bangalore 0.668 0.622 1.074 0.283
bowling_teamSunrisers Hyderabad -2.279 0.693 -3.287 0.001
wickets_remaining:overs_remaining 0.497 0.025 20.017 0.000
Note. Adjusted R-Squared: 0.0.837
Linear Regression Assumptions
par(mfrow=c(2,2))
#OLS Model 1

plot(lm1)

#OLS Model 2
plot(lm2)

Both models appear to be very similar. When checking some of the basic diagnostics it appears that while the plot of the residuals and fitted values fluctuate around zero, the variance is not constant. Furthermore, there appears to be some non-normality of the residuals in the tails. This may be addressed by transforming variables or adjusting which variables are included in the model. In this case, more flexible models will be used to address these potential issues.

Generalized Linear Models

Generalized linear models allow for more flexibility than linear regression models. Furthermore, as the task is modelling expected runs earned count specific models may be estimated within this class of models.

Poisson count models may be used when the mean and the variance of the count variable is the same. This strong assumption was tested.

P1 =glm(runs_earned~overs_remaining+wickets_remaining+
          overs_remaining:wickets_remaining,family = poisson(link="log"),
        data=ipl)
P1%>%
  tidy() %>%
  kable(digits = 3,col.names = c("Variable", "Estimate", "S.E.", "Stat.", "p-value"),
        align = c("l", "c", "c", "c", "c"),caption = 'Poisson Regression Estimates')%>%
    kable_classic_2(full_width = F, position = "left")
Poisson Regression Estimates
Variable Estimate S.E. Stat. p-value
(Intercept) 1.801 0.010 185.629 0
overs_remaining 0.197 0.001 211.635 0
wickets_remaining 0.225 0.001 174.854 0
overs_remaining:wickets_remaining -0.015 0.000 -141.197 0
#Cameron & Trivedi (1990) Dispersion test
dispersiontest(P1,trafo=1) #trafo = transformation function - linear specification
## 
##  Overdispersion test
## 
## data:  P1
## z = 79.1, p-value < 2.2e-16
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##    alpha 
## 3.922712
dispersiontest(P1,trafo=2) #trafo = transformation function - quadratic specification
## 
##  Overdispersion test
## 
## data:  P1
## z = 68.147, p-value < 2.2e-16
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##     alpha 
## 0.0359022

Overdispersion is significantly present and the more flexible negative binomial distribution is assumed.

Model:

count(\(runs\_earned\)) = exp[\(\alpha\) + \(\beta_1overs\_remaining\) +\(\beta_2wickets\_remaining\)+\(\beta_3overs\_remaining*wickets\_remaining\)+\(\delta_jbowling\_team\)+\(\gamma_ibatting_team\)]

where \(j\) and \(i\) represent the index for each team.

NB <- glm.nb(runs_earned~overs_remaining+wickets_remaining+ overs_remaining:wickets_remaining+bowling_team+ batting_team,link=“log”, data=subset(train.data,train.data$Top_1M == “No”))

NB1 <- glm.nb(runs_earned~overs_remaining+wickets_remaining+
            overs_remaining:wickets_remaining+batting_team+bowling_team,link="log",data=ipl)

NB1%>%
  tidy() %>%
  kable(digits = 3,col.names = c("Variable", "Estimate", "S.E.", "Stat.", "p-value"),
        align = c("l", "c", "c", "c", "c"),caption = 'Negative Binomial Regression Estimates')%>%
    kable_classic_2(full_width = F, position = "left")
Negative Binomial Regression Estimates
Variable Estimate S.E. Stat. p-value
(Intercept) 1.721 0.019 89.616 0.000
overs_remaining 0.227 0.002 113.027 0.000
wickets_remaining 0.226 0.002 92.796 0.000
batting_teamDeccan Chargers -0.052 0.011 -4.702 0.000
batting_teamDelhi Capitals -0.020 0.013 -1.521 0.128
batting_teamDelhi Daredevils -0.030 0.009 -3.191 0.001
batting_teamGujarat Lions -0.029 0.017 -1.663 0.096
batting_teamGujarat Titans -0.033 0.031 -1.077 0.281
batting_teamKings XI Punjab -0.028 0.009 -3.211 0.001
batting_teamKochi Tuskers Kerala -0.209 0.024 -8.692 0.000
batting_teamKolkata Knight Riders -0.024 0.008 -2.825 0.005
batting_teamLucknow Super Giants 0.039 0.026 1.511 0.131
batting_teamMumbai Indians 0.024 0.008 2.956 0.003
batting_teamPune Warriors -0.117 0.015 -7.876 0.000
batting_teamPunjab Kings -0.064 0.017 -3.890 0.000
batting_teamRajasthan Royals -0.062 0.009 -6.996 0.000
batting_teamRising Pune Supergiant -0.027 0.022 -1.244 0.214
batting_teamRising Pune Supergiants -0.129 0.024 -5.357 0.000
batting_teamRoyal Challengers Bangalore 0.011 0.008 1.295 0.195
batting_teamSunrisers Hyderabad -0.044 0.009 -4.804 0.000
bowling_teamDeccan Chargers 0.030 0.012 2.383 0.017
bowling_teamDelhi Capitals 0.050 0.012 4.038 0.000
bowling_teamDelhi Daredevils 0.013 0.009 1.419 0.156
bowling_teamGujarat Lions 0.024 0.017 1.407 0.159
bowling_teamGujarat Titans 0.128 0.028 4.591 0.000
bowling_teamKings XI Punjab 0.022 0.009 2.481 0.013
bowling_teamKochi Tuskers Kerala -0.148 0.024 -6.082 0.000
bowling_teamKolkata Knight Riders -0.037 0.009 -4.360 0.000
bowling_teamLucknow Super Giants 0.069 0.031 2.246 0.025
bowling_teamMumbai Indians -0.030 0.009 -3.435 0.001
bowling_teamPune Warriors -0.045 0.014 -3.281 0.001
bowling_teamPunjab Kings -0.033 0.024 -1.404 0.160
bowling_teamRajasthan Royals 0.008 0.009 0.850 0.395
bowling_teamRising Pune Supergiant 0.072 0.024 3.015 0.003
bowling_teamRising Pune Supergiants -0.009 0.024 -0.393 0.694
bowling_teamRoyal Challengers Bangalore 0.003 0.009 0.371 0.710
bowling_teamSunrisers Hyderabad -0.034 0.010 -3.484 0.000
overs_remaining:wickets_remaining -0.017 0.000 -76.579 0.000
NB2 <- glm.nb(runs_earned~factor(overs_remaining)+wickets_remaining+
            overs_remaining:wickets_remaining+batting_team+bowling_team,link="log",data=ipl)


NB2%>%
  tidy() %>%
  kable(digits = 3,col.names = c("Variable", "Estimate", "S.E.", "Stat.", "p-value"),
        align = c("l", "c", "c", "c", "c"),caption = 'Negative Binomial Regression Estimates with Overs Remaining as a Factor')%>%
    kable_classic_2(full_width = F, position = "left")
Negative Binomial Regression Estimates with Overs Remaining as a Factor
Variable Estimate S.E. Stat. p-value
(Intercept) 1.996 0.020 97.422 0.000
factor(overs_remaining)2 0.629 0.016 39.383 0.000
factor(overs_remaining)3 0.965 0.016 62.209 0.000
factor(overs_remaining)4 1.197 0.016 77.106 0.000
factor(overs_remaining)5 1.370 0.016 86.411 0.000
factor(overs_remaining)6 1.513 0.016 91.816 0.000
factor(overs_remaining)7 1.629 0.017 93.323 0.000
factor(overs_remaining)8 1.726 0.019 92.123 0.000
factor(overs_remaining)9 1.811 0.020 89.155 0.000
factor(overs_remaining)10 1.884 0.022 85.028 0.000
factor(overs_remaining)11 1.954 0.024 80.626 0.000
factor(overs_remaining)12 2.014 0.027 75.690 0.000
factor(overs_remaining)13 2.070 0.029 71.080 0.000
factor(overs_remaining)14 2.117 0.032 66.465 0.000
factor(overs_remaining)15 2.170 0.035 62.250 0.000
factor(overs_remaining)16 2.221 0.038 58.351 0.000
factor(overs_remaining)17 2.265 0.041 54.575 0.000
factor(overs_remaining)18 2.307 0.045 51.192 0.000
factor(overs_remaining)19 2.340 0.049 48.042 0.000
factor(overs_remaining)20 2.368 0.052 45.200 0.000
wickets_remaining 0.080 0.003 25.766 0.000
batting_teamDeccan Chargers -0.055 0.009 -5.809 0.000
batting_teamDelhi Capitals -0.029 0.012 -2.537 0.011
batting_teamDelhi Daredevils -0.029 0.008 -3.496 0.000
batting_teamGujarat Lions -0.034 0.015 -2.225 0.026
batting_teamGujarat Titans -0.007 0.026 -0.256 0.798
batting_teamKings XI Punjab -0.039 0.007 -5.192 0.000
batting_teamKochi Tuskers Kerala -0.223 0.021 -10.689 0.000
batting_teamKolkata Knight Riders -0.030 0.007 -4.027 0.000
batting_teamLucknow Super Giants 0.044 0.022 1.992 0.046
batting_teamMumbai Indians 0.014 0.007 2.041 0.041
batting_teamPune Warriors -0.123 0.013 -9.587 0.000
batting_teamPunjab Kings -0.083 0.014 -5.791 0.000
batting_teamRajasthan Royals -0.073 0.008 -9.475 0.000
batting_teamRising Pune Supergiant -0.029 0.019 -1.545 0.122
batting_teamRising Pune Supergiants -0.125 0.021 -5.995 0.000
batting_teamRoyal Challengers Bangalore 0.010 0.007 1.392 0.164
batting_teamSunrisers Hyderabad -0.047 0.008 -5.962 0.000
bowling_teamDeccan Chargers 0.035 0.011 3.217 0.001
bowling_teamDelhi Capitals 0.049 0.011 4.591 0.000
bowling_teamDelhi Daredevils 0.015 0.008 1.952 0.051
bowling_teamGujarat Lions 0.036 0.015 2.434 0.015
bowling_teamGujarat Titans 0.148 0.024 6.178 0.000
bowling_teamKings XI Punjab 0.025 0.008 3.177 0.001
bowling_teamKochi Tuskers Kerala -0.133 0.021 -6.295 0.000
bowling_teamKolkata Knight Riders -0.038 0.007 -5.199 0.000
bowling_teamLucknow Super Giants 0.076 0.026 2.894 0.004
bowling_teamMumbai Indians -0.024 0.008 -3.172 0.002
bowling_teamPune Warriors -0.043 0.012 -3.643 0.000
bowling_teamPunjab Kings -0.046 0.021 -2.237 0.025
bowling_teamRajasthan Royals 0.013 0.008 1.750 0.080
bowling_teamRising Pune Supergiant 0.059 0.021 2.852 0.004
bowling_teamRising Pune Supergiants -0.016 0.020 -0.782 0.434
bowling_teamRoyal Challengers Bangalore 0.004 0.007 0.478 0.633
bowling_teamSunrisers Hyderabad -0.035 0.008 -4.198 0.000
wickets_remaining:overs_remaining 0.000 0.000 -1.533 0.125

As the analysis is done from the point of view of the batting team the random intercepts portion of this analysis is chosen to be the batting team. Each team may have a different intercept for their batting abilities as it relates to runs earned.

Model:

count(\(runs\_earned\)) = exp[\(\alpha\) + \(\alpha_ibatting\_team\) + \(\beta_1overs\_remaining\) +\(\beta_2wickets\_remaining\)+\(\beta_3overs\_remaining*wickets\_remaining\)+\(\delta_jbowling\_team\)]

where \(\alpha\) represents the fixed intercept, \(\alpha_i\) represents the random intercept for team \(i\), and \(j\) represents the index for each bowling team.

MENB <- glmer.nb(runs_earned~overs_remaining+wickets_remaining+
                 overs_remaining:wickets_remaining+bowling_team+ (1| batting_team),data=ipl,nAGQ=0)
tab_model(MENB)
  runs_earned
Predictors Incidence Rate Ratios CI p
(Intercept) 5.35 5.13 – 5.59 <0.001
overs remaining 1.26 1.25 – 1.26 <0.001
wickets remaining 1.25 1.25 – 1.26 <0.001
bowling team [Deccan
Chargers]
1.03 1.01 – 1.06 0.017
bowling team [Delhi
Capitals]
1.05 1.03 – 1.08 <0.001
bowling team [Delhi
Daredevils]
1.01 1.00 – 1.03 0.159
bowling team [Gujarat
Lions]
1.02 0.99 – 1.06 0.193
bowling team [Gujarat
Titans]
1.14 1.08 – 1.20 <0.001
bowling team [Kings XI
Punjab]
1.02 1.00 – 1.04 0.016
bowling team [Kochi
Tuskers Kerala]
0.86 0.82 – 0.90 <0.001
bowling team [Kolkata
Knight Riders]
0.96 0.95 – 0.98 <0.001
bowling team [Lucknow
Super Giants]
1.07 1.01 – 1.14 0.024
bowling team [Mumbai
Indians]
0.97 0.95 – 0.99 0.001
bowling team [Pune
Warriors]
0.95 0.93 – 0.98 0.001
bowling team [Punjab
Kings]
0.97 0.93 – 1.02 0.191
bowling team [Rajasthan
Royals]
1.01 0.99 – 1.03 0.406
bowling team [Rising Pune
Supergiant]
1.07 1.03 – 1.13 0.002
bowling team [Rising Pune
Supergiants]
0.99 0.95 – 1.04 0.698
bowling team [Royal
Challengers Bangalore]
1.00 0.99 – 1.02 0.764
bowling team [Sunrisers
Hyderabad]
0.97 0.95 – 0.99 0.001
overs remaining × wickets
remaining
0.98 0.98 – 0.98 <0.001
Random Effects
σ2 0.07
τ00 batting_team 0.00
ICC 0.04
N batting_team 18
Observations 17891
Marginal R2 / Conditional R2 0.862 / 0.868

The estimated second source of variance from the batting team is very small. It does not appear that adding this random term in the model improves anything.

Random Forest

The random forest algorithm combines the output of multiple decision trees to reach a single result. It is more computationally demanding than the simple regression models presented earlier but it may be able to better predict runs earned. In this challenge, 500 models are tested per random forest call.

RF1 <- randomForest(runs_earned~factor(overs_remaining)+wickets_remaining+
            overs_remaining:wickets_remaining+batting_team+bowling_team,
            data=ipl, proximity=TRUE)

print(RF1)
## 
## Call:
##  randomForest(formula = runs_earned ~ factor(overs_remaining) +      wickets_remaining + overs_remaining:wickets_remaining + batting_team +      bowling_team, data = ipl, proximity = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 374.7035
##                     % Var explained: 83.4

Cross Validation

In order to demonstrate the quality of the results ten-fold cross validation was conducted on each of the models focusing on prediction accuracy. Two measures of accuracy were used. The mean absolute error (MAE), and root mean squared error (RMSE) were used. Smaller values for mean absolute error (MAE) and root mean squared error (RMSE) are desired.

set.seed(14)

CVIPL <- ipl 
#shuffle
CVIPL<-CVIPL[sample(nrow(CVIPL)),] 
#Create 10 equally size folds
folds <- cut(seq(1,nrow(CVIPL)),breaks=10,labels=FALSE)

Prediction.Capability <- data.frame(matrix(ncol = 4, nrow = 0 )) #Dataframe for results
x <- c("Model","Test_Number", "MAE","RMSE") #col names
colnames(Prediction.Capability) <- x
rm(x)
PCLM <- Prediction.Capability
PCPM <- Prediction.Capability
PCNB <- Prediction.Capability
PCNB.ME <- Prediction.Capability
PCRF <- Prediction.Capability


#Perform 10 fold cross validation
for(i in 1:10){
  #Segement your data by fold using the which() function 
  testIndexes <- which(folds==i,arr.ind=TRUE)
  testData <- CVIPL[testIndexes, ]
  trainData <- na.omit(CVIPL[-testIndexes, ])
  
  
  lm <- lm(runs_earned~factor(overs_remaining)+wickets_remaining+
            overs_remaining:wickets_remaining+batting_team+bowling_team,
            data=trainData)
  
  P <- glm(runs_earned~overs_remaining+wickets_remaining+
          overs_remaining:wickets_remaining,family = poisson(link="log"),
          data=trainData)
  
  NB2 <- glm.nb(runs_earned~factor(overs_remaining)+wickets_remaining+
            overs_remaining:wickets_remaining+batting_team+bowling_team,
            link="log",data=trainData)
  
  
  NB.ME<- glmer.nb(runs_earned~overs_remaining+wickets_remaining+
                   overs_remaining:wickets_remaining+bowling_team+
                   (1|batting_team),data=trainData,nAGQ=0)
  
  
  RF <- randomForest(runs_earned~factor(overs_remaining)+wickets_remaining+
            overs_remaining:wickets_remaining+batting_team+bowling_team,
            data=trainData, proximity=TRUE)
  
  
  
  
  #OLS#
  # Make predictions and compute MAE, and RMSE #
  predictions <- lm %>% predict(testData, type = "response")
  predictions <- as.data.frame(predictions)
  names(predictions)[1] <- 'Pred'
  mae <- MAE(predictions$Pred, as.numeric(testData$runs_earned))
  rmse <- RMSE(predictions$Pred, as.numeric(testData$runs_earned))

  #assign(paste("logit4_5",i, sep=""), logit4_5) #Save each model
  PCLM[1,] <- c("LM",i,mae,rmse)
  
  
  #Poisson#
  # Make predictions and compute MAE, and RMSE #
  predictions <- P %>% predict(testData, type = "response")
  predictions <- as.data.frame(predictions)
  names(predictions)[1] <- 'Pred'
  mae <- MAE(exp(predictions$Pred), as.numeric(testData$runs_earned)) #Back to original scale
  rmse <- RMSE(exp(predictions$Pred), as.numeric(testData$runs_earned))

  #assign(paste("logit4_5",i, sep=""), logit4_5) #Save each model
  PCPM[1,] <- c("Poisson",i,mae,rmse)
  

  
  #NB#
  # Make predictions and compute MAE, and RMSE #
  predictions <- NB2 %>% predict(testData, type = "response")
  predictions <- as.data.frame(predictions)
  names(predictions)[1] <- 'Pred'
  mae <-  MAE(exp(predictions$Pred), as.numeric(testData$runs_earned)) #Back to original scale
  rmse <- RMSE(exp(predictions$Pred), as.numeric(testData$runs_earned))

  #assign(paste("logit4_5",i, sep=""), logit4_5) #Save each model
  PCNB[1,] <- c("NB",i,mae,rmse)
  
  
  
  
  #M.E. NB#
  # Make predictions and compute MAE, and RMSE #
  predictions <- NB.ME %>% predict(testData, type = "response")
  predictions <- as.data.frame(predictions)
  names(predictions)[1] <- 'Pred'
  mae <- MAE(predictions$Pred, as.numeric(testData$runs_earned))
  rmse <- RMSE(predictions$Pred, as.numeric(testData$runs_earned))

  #assign(paste("logit4_5",i, sep=""), logit4_5) #Save each model
  PCNB.ME[1,] <- c("NB.ME",i,mae,rmse)
  
  
  
  #RF#
  # Make predictions and compute MAE, and RMSE #
  predictions <- RF %>% predict(testData, type = "response")
  predictions <- as.data.frame(predictions)
  names(predictions)[1] <- 'Pred'
  mae <- MAE(predictions$Pred, as.numeric(testData$runs_earned))
  rmse <- RMSE(predictions$Pred, as.numeric(testData$runs_earned))

  #assign(paste("logit4_5",i, sep=""), logit4_5) #Save each model
  PCRF[1,] <- c("RF",i,mae,rmse)
  
  
  

  
  Prediction.Capability <- rbind(Prediction.Capability,PCLM,PCPM,PCNB,PCNB.ME,PCRF)
}

Prediction.Capability <- Prediction.Capability[order(Prediction.Capability$Model),]

The negative binomial count model performs the best when looking at the MAE and second best using RMSE. The linear regression model, although poorly specified, also does a good job at predicting runs earned compared to the other models. The mixed effects negative binomial and Poisson regression models perform the worst in this case.

Prediction.Capability <-  Prediction.Capability %>% 
  mutate_at(vars( MAE, RMSE), as.numeric)


aggregate(Prediction.Capability[, 3:4], list(Prediction.Capability$Model), mean)

Overall, including additional variables or terms reflecting the individual players could go a long way in improving these models. It is clear that wickets remaining and overs remaining impact the number of runs teams earn.

Resources